How I lost 1000€ betting on CS:GO

With machine learning and Python

Author

Pedro Tabacof

Published

November 11, 2023

PyCon 2023

PyCon 2023 slides

Intercom is hiring for the ML team in Dublin! We are hiring engineers and scientists at senior and staff levels. Check out our careers page.

Blog post (WiP)

This is a true story of how I lost money using machine learning to bet on CS:GO. The original idea and implementation came from a friend, who gave me permission to share this story in public. Just to make sure I got it right, since this took place in 2019, I retrieved my bets from the betting website to plot my cumulative returns:

As you can see, I lost a lot of money very quickly, then my losses plateued until I grew bored and decided to cut my losses. My friend persisted and he ended up making a 7.5% return on investment (ROI)! We did have an edge after all and I squandered it. This post is a lesson on how to bet with ML and also how to avoid my mistakes. It is divded into two parts.

First, I go over the fundamentals (theory):

  • What is your edge?
  • Financial decision-making with ML
  • Probability calibration
  • TrueSkill: inferential vs predictive models
  • How much to bet?
  • Winner’s curse

Then, I go over our actual solution (practice):

  • Data scraping
  • Feature engineering
  • Modelling
  • Evaluation
  • Backtesting
  • Why I lost a 1000 euros

Part 1: Foundations

What is your edge?

If you’re playing a poker game and you look around the table and you can’t tell who the sucker is, it’s you.

If you want to bet or trade, you need to have an edge. While the efficient market hypothesis is a good approximation, clearly there are situations where you can exploit market inefficiencies for monetary gains. That is your edge. That is how funds like Rennaisance Technologies make money over decades. For a clear first-person narrative on finding edges and beating casinos and markets, read Ed Thorp’s biography.

Watch out for anecdotal evidence of beating the market it could solely due to selection and survivorship bias. Edges are not sold on the public market, so be skeptical with anyone trying to sell you a way to make money. Whenever an edge becomes public, it’s not an edge anymore. And that is why you should read about failure stories, like mine, as they carry important lessons not contamined by bad incentives or biases.

Machine learning for financial decision-making

How can you use machine learning (ML) for sports betting? That has a surprising long history: Decades ago, logistic regression was being used for horse racing betting and made some people rich. Essentially, it’s not different from using ML in any other financial decision-making, such as giving loans or stopping fraudulent transactions.

Let’s work backwards from a simple profits equation. Profits can be defined as revenue minus costs1 and a function of actions a. The actions a you take are betting, providing credit, and blocking transactions. You can simply write it as:

\[ \text{Profits}[a] = \sum \text{Revenue}(a) - \sum \text{Cost}(a) \]

While simple and naive, defining the revenue and losses for your product or business can be an illuminating exercise, what differentiates “Kaggle” from “business” data scientists. For more on this topic, read my blogpost The Hierarchy of Machine Learning Needs. Let’s define action, revenue and cost for the aforementioned examples:

  • Betting:
    • Action: Bet size (between 0 and the limit offered by the betting house)
    • Revenue: Bet size * Odds * Probability of winning
    • Cost: Bet size
  • Loans
    • Action: Loan amount (between 0 and some limit imposed by the bank)
    • Revenue: Interest rate * loan amount
    • Cost: Probability of default * loan amount
  • Fraud
    • Action: Blocking the transaction or not (binary)
    • Revenue: Transaction amount * Margins
    • Cost: Transaction amount * Probability of fraud

ML can help by filling in the missing quantities (in bold): predicting the probability of winning a bet, default of a loan or fraud for a transaction. With such probabilities, you can choose an action a that maximizes profits. Let’s work that out in the case of betting.

Betting decision rule

Say you have a match between two teams. Assume that the probability of A beating B is P(X_A, X_B), where X are the features of each team (say, their latest result).

If you place a bet of \$10 at odds of 2, you will net \$20 if A wins and lose \$10 if A loses. The expected value of the bet is:

\[ E[bet=10] = 20 \cdot P(X_A, X_B) - 10 \cdot (1 - P(X_A, X_B)) = 30 \cdot P(X_A, X_B) - 10 \]

More broadly:

\[ E[bet] = E[Revenue]-E[Loss] = \text{Bet}*\text{Odds}*\text{Probability of Winning}- \text{Bet}*(1-\text{Probability of Winning}) \]

This leads to the following decision rule: if \(P(X_A, X_B) \gg 1/3\), you should bet on A. If \(P(X_A, X_B) \ll 1/3\), you should bet on B. If \(P(X_A, X_B) \approx 1/3\), you should not bet at all. The reason for using “much greater/lesser” inequalities is due to inefficiencies in the betting process, such as transaction fees, and to the ever-present model prediction errors.

Now, one bet cannot be made in isolation. There are other potential bets you can make in the present and future. For simplicity, I will assume bets are made sequentially and on a limited bankroll. How do you maximize the expected value of all your bets?

Kelly criterion

The Kelly criterion is a formula used to determine the optimal size of a series of bets, developed by John Kelly in 1956. It’s based on the idea of maximizing the expected value of the logarithm of wealth2. I will not bore you with the derivation here, but the end result is a simple formula providing you the fraction of your bankroll to wager:

\[ \text{Fraction} = \frac{\text{Odds} * \text{Probability of Winning} - (1 - \text{Probability of Winning})}{\text{Odds}} \]

Let’s go back to the first example and assume the probability of winning to be 0.6. The optimal fraction of the bankroll to wager is:

\[ \text{Fraction} = \frac{2*0.6 - 0.4}{2} = 0.5 \]

That is, you should bet 25% of your bankroll! That sounds suspiciously large, given that you only have a 60% chance of winning. Unfortunately, the Kelly criterion is not actually reliable for real decision-making:

  • It assumes that the probabilites are known and there is no uncertainty in them
  • It’s way too aggressive and allows for too much variance in the bankroll

The best way to understand why the Kelly criterion is too aggressive is by looking at the following chart, which shows the relationship between the bankroll growth rate and fraction of bankroll wagered:

Code
# Parameters from the image
p = 0.5
q = 1.0-p
b = 2
a = 1

# Growth rate function
def growth_rate(f, p, q, b, a):
    return p * np.log(1 + b*f) + q * np.log(1 - a*f)

# Generate data points for plotting
f_values = np.linspace(0, 0.75, 500)  # Wagered fraction values from 0% to 50%
r_values = growth_rate(f_values, p, q, b, a)

# Using plotly express to create the base plot
fig = px.line(x=f_values, y=r_values, labels={'x': 'Wagered fraction', 'y': 'Growth rate'},
              title="Growth rate as a function of wagered fraction")

# Formatting axes to show percentages
fig.update_layout(
    width=600, 
    height=600,
    xaxis_tickformat=",.0%",  # x-axis percentages
    yaxis_tickformat=",.1%"   # y-axis percentages
)

# Adding the vertical line
fig.add_shape(
    go.layout.Shape(
        type="line",
        x0=0.25,
        y0=0,
        x1=0.25,
        y1=0.059,
        line=dict(color="Red", dash="dot")
    )
)

# Annotations
fig.add_annotation(
    go.layout.Annotation(
        text='Optimum "Kelly" bet',
        xref="x",
        yref="y",
        x=0.25,
        y=0.059,
        showarrow=True,
        arrowhead=4,
        ax=60,
        ay=-40
    )
)

# Adding parameters as a footnote
param_str = f"Parameters: P_winning={p}, Odds={b}"
fig.add_annotation(
    go.layout.Annotation(
        text=param_str,
        xref="paper",
        yref="paper",
        x=0,
        y=-0.15,
        showarrow=False,
        align="left",
        font=dict(size=10)
    )
)

fig.show()

Note the asymmetry: if you bet too little, you will not grow your bankroll fast enough. But if you bet too much, you will lose your bankroll very quickly. The Kelly criterion is the point where the growth rate is maximized, but the curve is flat around that point, which means that you can bet less and still get a high growth rate without the risk associated with the “optimal betting point”.

This is more general than it seems: when I worked in performance marketing, the curve between profits and marketing spend was surprisingly similar. My former colleague wrote about it here: Forecasting Customer Lifetime Value - Why Uncertainty Matters.

How much should you bet, then? A simple heuristic is simply always betting a fixed value, no matter what. Another, is using the half-Kelly, that is, half of what the Kelly criterion implies. This is a good compromise between the two extremes.

Probability calibration

A lot of the assumptions above are not true in real life. For example, the probability of a team winning is not known, but it can be estimated from historical data. But even if you have a good model that predicts the probability of a team winning, it doesn’t mean that the probability is calibrated.

Before we go on further, we need to define what probability calibration means:

Code
# Data from the image
mean_predicted_values = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
fraction_of_positives = [.05, 0.1, 0.2, .35, 0.5, .65, 0.8, 0.9, 0.95]

# Create a scatter plot for the model's calibration
fig = go.Figure()

# Add the perfectly calibrated line
fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines', name='Perfectly calibrated',
    line=dict(dash='dot')
))

# Add the model's calibration curve
fig.add_trace(go.Scatter(
    x=mean_predicted_values, y=fraction_of_positives,
    mode='lines+markers', name='Actual model',
    line=dict(color='black')
))

# Set layout properties
fig.update_layout(
    title="Calibration plot",
    xaxis_title="Mean predicted value",
    yaxis_title="Fraction of positives",
    xaxis=dict(tickvals=[i/10 for i in range(11)], range=[0, 1]),
    yaxis=dict(tickvals=[i/10 for i in range(11)], range=[0, 1]),
    showlegend=True
)

fig.show()

Note that calibration is not enough: if your model predicts 50% for a coin flip, it’s perfectly calibrated but also totally useless!

How do you calibrate a model? If your loss function is a proper scoring rule, like the the negative log-likelihood (NLL), it should come calibrated by default! But if your loss doesn’t lead to calibration or empirically you see calibration issues, you can always use a calibration method:

  • Platt scaling: logistic regression on the output of the model
  • Isotonic regression: non-parametric method that fits a piecewise-constant, non-decreasing function to the data

How do you measure if your model is calibrated? I suggest a two-pronged approach:

  1. Calibration plot (like shown above): visual inspection always helps making sense of the data
  2. Brier score / negative log-likelihood: they are proper scoring rules and they balance calibration and accuracy, so the smaller the better

Winner’s curse

If it sounds too good to be true, it probably is!

The winner’s curse is a phenomenon that occurs when people overpay for something that they won in an auction. Consider the following thought experiment: you have N people bidding for a product. The true value of the product is Value and each person estimates the value of the product with \(v \sim N(\text{Value}, \sigma)\), where N is a normal distribution and \(\sigma\) is the standard deviation of the estimate. The person who bids the highest, wins. Assuming a second-price auction 3, what is the expected value of the winning bid?

Let’s assume the item is worth \$1000, the standard deviation of the estimates is \$50, and that there are 10 bidders. In a second-price auction, you bet the expected value of the item. If you win, you only have to pay the second highest bid. We can simulate that easily:

Code
# Winner's curse simulation 
V = 1000  # Real value of the item
N = 10  # Number of bidders
sigma = 50  # Standard deviation of the bidders' valuations

overpayment = []
for _ in range(10_000): # Simulation samples
    v = np.random.normal(V, sigma, N)
    winning_bid = np.sort(v)[-2]  # Winning bid pays second highest bid
    overpayment.append(winning_bid - V)
mean_overpayment = np.mean(overpayment)

# Create the histogram
fig = go.Figure()

fig.add_trace(go.Histogram(
    x=overpayment,
    name='Overpayment',
    marker=dict(color='blue'),
    opacity=0.5,
))

# Add a vertical line for the mean
fig.add_shape(
    go.layout.Shape(
        type='line',
        x0=mean_overpayment,
        x1=mean_overpayment,
        y0=0,
        y1=1,
        yref='paper',
        line=dict(color='red', width=2, dash='dot')
    )
)

# Add annotations and labels
fig.update_layout(
    title="Winner's Curse: Histogram of Overpayment",
    xaxis_title='Overpayment ($)',
    yaxis_title='Frequency',
    shapes=[dict(
        x0=mean_overpayment,
        x1=mean_overpayment,
        y0=0,
        y1=1,
        yref='paper',
        line=dict(color='red', width=2, dash='dot')
    )],
    annotations=[dict(
        x=mean_overpayment,
        y=0.9,
        yref='paper',
        showarrow=True,
        arrowhead=7,
        ax=0,
        ay=-40,
        text=f"Mean: {mean_overpayment:.2f}"
    )]
)

fig.show()

That is, the expected overpayment conditional on winning the auction is \$50, 5% of the actual value of the item! What is surprising is that the winner’s curse can exist even if there is no bias in the estimates of the item’s value.

In auctions, it’s very difficult to be rid of the winner’s curse, as it depends on not just your estimate of the item’s value but also everyone else’s. Plus, it also depends whether it’s a first- or second-price auction.

For sports betting, consider the following: If your model differs from the average betting odds, which one is more likely to be wrong? The only protection is extensive validation: make sure to backtest your strategy and paper-trade before spending real money. More on backtesting later.

TrueSkill

TrueSkill is a Bayesian skill rating system developed by Microsoft for multiplayer games. It aims to estimate the “true skill” of each player or team based on their performance history. The model uses a Gaussian distribution to represent the skill level of each player, and it updates these skill levels after each match using Bayesian inference. TrueSkill considers not just the outcome but also the uncertainty around each player’s skill, providing a more dynamic and accurate measure of player ability.

Inferential vs predictive models

If we have TrueSkill, why do we even need a ML model? TrueSkill serves as an inferential model aimed at understanding the underlying skill levels of players or teams based on their historical performance. Its primary goal is not to predict future match outcomes but to estimate parameters that describe the skill and uncertainty levels of each player. While it provides valuable latent variables, such as skill ratings, it doesn’t necessarily consider all the variables that might influence the outcome of a future match, such as current form, injuries, or team strategies.

On the other hand, machine learning models in this context are predictive. Their primary goal is to accurately predict the outcome of future matches. These models have the advantage of being able to incorporate a wide range of features, including but not limited to those offered by TrueSkill. By doing so, they can capture complex, non-linear relationships and interactions between variables, offering potentially higher prediction accuracy.

So why not use TrueSkill output directly for betting on e-sport matches? There are several reasons. First, TrueSkill’s scope is limited to estimating the “true skill” of players, which, although important, is not the only factor affecting match outcomes. Second, machine learning models can capture non-linearities and interactions between features, something that a simpler model like TrueSkill is not designed to do. Third, using ensemble methods, you can combine TrueSkill estimates with other predictive models to potentially improve accuracy. Lastly, machine learning models can be more adaptable to new data or changes in the game, making them more flexible for prediction tasks.

Part 2: Solution (Heavy WiP)

Summary: * Web scraping using Selenium and Beautiful Soup * Collected over 30k CS:GO matches * Collected on average 30 betting odds per match * Feature engineering: 100s of features * Also used the TrueSkill model as feature * Modelling with LightGBM * Out-of-time evaluation with AUC and calibration metrics * Backtesting to calculate the ROI

Web scraping

Data is the new oil.

Feature engineering

df = pd.read_parquet("dataset.parquet")

Modelling

Train-test split

dataset = pd.read_parquet("dataset.parquet").drop_duplicates()

dt_train = '2019-01-01'
dt_test = '2019-08-01'

dataset['target'] = (dataset['winner'] == 'team1').astype(bool)
dataset = dataset[
    (dataset['match_date'] >= '2017-01-01') &
    (dataset['winner'] != 'tie') &
    (dataset['match_id'] != 'https://www.hltv.org/matches/2332976/lucid-dream-vs-alpha-red-esl-pro-league-season-9-asia')
].reset_index(drop=True)

mask_train = dataset['match_date'] < dt_train
dataset_train = dataset.loc[mask_train].reset_index(drop=True)

dataset_train2 = dataset_train.sample(frac=1).reset_index(drop=True)
dataset_train2['target'] = ~dataset_train2['target']
cols = []
for c in list(dataset_train.columns):
    if c.startswith('team1_'):
        cols.append(c.replace('team1_', 'team2_').replace('_team2', '_team1'))
    elif c.startswith('team2_'):
        cols.append(c.replace('team2_', 'team1_').replace('_team1', '_team2'))
    else:
        cols.append(c)
dataset_train2 = dataset_train2.rename(columns=dict(zip(dataset_train.columns, cols)))
dataset_train = dataset_train[cols]
dataset_train2 = dataset_train2[cols]
dataset_train = pd.concat([dataset_train, dataset_train2], axis=0, ignore_index=True).reset_index(drop=True)

idxs = np.random.choice(len(dataset_train), replace=False, size=4000)
dataset_val = dataset_train.loc[idxs].drop_duplicates('match_id').reset_index(drop=True)
dataset_val = dataset_val.reset_index(drop=True)

index = np.arange(len(dataset_train))
mask = ~np.in1d(index, idxs)
dataset_train = dataset_train.loc[mask].reset_index(drop=True)

mask_test = (
    (dataset['match_date'] >= dt_train) &
    (dataset['match_date'] < dt_test)
)
dataset_test = dataset.loc[mask_test].reset_index(drop=True)
dataset_train.shape, dataset_val.shape, dataset_test.shape
((38490, 267), (3824, 267), (4759, 267))

Model: LightGBM

We used a standard off-the-shelf LightGBM binary classifier. There are many advantages to use LGBM or XGBoost for tabular data problems:

  • Handles missing values natively
  • Handles categorical features natively
  • Early stopping to set the number of estimators
  • Blazing fast and scalable
  • Loss functions options, including using a custom one
    • For binary classification, the default is the negative logloss (a proper scoring rule)

For more information on how to unlock the power of LightGBM, watch my PyData London 2022 presentation.

We do something special: since we have features for team1 and team2 and we swap them during training, we average out two predictions, one made for team1 and another made for team2.

class CSGOPredictor(object):
    def __init__(self, model_params):
        self.model_params = model_params

    def fit(self, x_train, y_train, x_val, y_val):
        self.lgb = LGBMClassifier(**self.model_params).fit(
            x_train, y_train,
            eval_names=['training', 'validation'],
            eval_set=[(x_train, y_train), (x_val, y_val)],
            early_stopping_rounds=50,
            verbose=50,
        )
        return self

    def predict_proba(self, x):
        original = self.lgb.predict_proba(x)

        x_inv = x.copy()
        team1_cols = [i for i in x_inv.columns if i.startswith('team1')]
        team2_cols = [i for i in x_inv.columns if i.startswith('team2')]

        x_inv = x_inv.rename(dict(zip(team1_cols + team2_cols, team2_cols + team1_cols)), axis=1)
        x_inv = x_inv.reindex(columns=x.columns)

        inv = self.lgb.predict_proba(x_inv)

        inv[:, 0], inv[:, 1] = inv[:, 1], inv[:, 0].copy()

        return (original+inv)/2.0
    
    def predict(self, x):
        return self.predict_proba(x).argmax(axis=1)
Code
drop_cols = ['winner',  'match_date', 'match_id', 'event_id', 'team1_id', 'team2_id', 'target']

x_train = dataset_train.drop(columns=drop_cols, axis=1)
y_train = dataset_train['target']
features = list(x_train.columns)

x_val = dataset_val[features]
y_val = dataset_val['target']

x_test = dataset_test[features]
y_test = dataset_test['target']

model_params = {
    'n_estimators': 10_000,
    'learning_rate': 0.05
}

model = CSGOPredictor(model_params).fit(x_train, y_train, x_val, y_val)
/Users/pedrotabacof/anaconda3/lib/python3.11/site-packages/lightgbm/sklearn.py:726: UserWarning:

'early_stopping_rounds' argument is deprecated and will be removed in a future release of LightGBM. Pass 'early_stopping()' callback via 'callbacks' argument instead.

/Users/pedrotabacof/anaconda3/lib/python3.11/site-packages/lightgbm/sklearn.py:736: UserWarning:

'verbose' argument is deprecated and will be removed in a future release of LightGBM. Pass 'log_evaluation()' callback via 'callbacks' argument instead.
[50]    training's binary_logloss: 0.563087 validation's binary_logloss: 0.587734
[100]   training's binary_logloss: 0.535987 validation's binary_logloss: 0.584062

Feature importance

While typically I’d recommend using SHAP, for simplicity I just plot the standard split gain importance:

Code
# Get feature importances and create a DataFrame
feature_importances = model.lgb.feature_importances_

# Select the top_n importances
top_n = 20
sorted_idx = feature_importances.argsort()[-top_n:][::-1]  # Sort and reverse to get top features first
top_features = np.array(features)[sorted_idx]
top_importances = feature_importances[sorted_idx]

# Create the bar chart
fig = go.Figure(go.Bar(
    x=top_importances,
    y=top_features,
    orientation='h'
))

# Update layout for a more informative plot
fig.update_layout(
    title='Top 20 Feature Importances',
    xaxis_title='Importance',
    yaxis_title='Feature',
    yaxis=dict(autorange='reversed'),
    height=700,
    width=800,
)

fig.show()

Evaluation

We evaluate using the following metrics:

  • Accuracy: how many bets you expect to get right
  • AUC: how well you rank-order the winners/losers
  • Brier score: a metric takes both calibarion and accuracy into account

I also plot the calibration curves for the training and test sets.

# Function to calculate metrics
def calculate_metrics(X, y, model):
    y_pred_proba = model.predict_proba(X)[:, 1]  # Probability of the positive class
    y_pred = model.predict(X)
    return {
        'Accuracy': accuracy_score(y, y_pred),
        'AUC': roc_auc_score(y, y_pred_proba),
        'Brier_score': brier_score_loss(y, y_pred_proba)
    }

# Calculate metrics for each set
metrics_train = calculate_metrics(x_train, y_train, model)
metrics_val = calculate_metrics(x_val, y_val, model)
metrics_test = calculate_metrics(x_test, y_test, model)

# Create a DataFrame to display as a table
metrics_df = pd.DataFrame([metrics_train, metrics_val, metrics_test],
                          index=['Training', 'Validation', 'Test'])
metrics_df
Accuracy AUC Brier_score
Training 0.725227 0.805531 0.181373
Validation 0.705021 0.779005 0.191575
Test 0.700357 0.772336 0.191716
Code
def plot_calibration_curve(y_true, y_pred_proba, set_name, fig, color):
    mean_predicted_value, fraction_of_positives = calibration_curve(y_true, y_pred_proba, n_bins=10)
    fig.add_trace(go.Scatter(
        x=mean_predicted_value, y=fraction_of_positives,
        mode='lines+markers', name=f'{set_name} set',
        line=dict(color=color)
    ))

# Create a new figure for the calibration plot
calibration_fig = go.Figure()

# Add the perfectly calibrated line
calibration_fig.add_trace(go.Scatter(
    x=[0, 1], y=[0, 1],
    mode='lines', name='Perfectly calibrated',
    line=dict(dash='dot')
))

# Plot calibration curve for the training set
plot_calibration_curve(y_train, model.predict_proba(x_train)[:, 1], 'Training', calibration_fig, 'blue')

# Plot calibration curve for the test set
plot_calibration_curve(y_test, model.predict_proba(x_test)[:, 1], 'Test', calibration_fig, 'red')

# Set layout properties for the calibration plot
calibration_fig.update_layout(
    title="Calibration plot",
    xaxis_title="Mean predicted value",
    yaxis_title="Fraction of positives",
    xaxis=dict(tickvals=[i/10 for i in range(11)], range=[0, 1]),
    yaxis=dict(tickvals=[i/10 for i in range(11)], range=[0, 1]),
    showlegend=True
)

calibration_fig.show()
Code
def auc_over_time(df, model, date_col, target_col, features):
    # Make a copy to avoid modifying the original dataframe and convert match_date to datetime
    weekly_df = df.copy()
    weekly_df[date_col] = pd.to_datetime(weekly_df[date_col])

    # Create a 'week_start_date' column for grouping that represents the start of the week
    weekly_df['week_start_date'] = weekly_df[date_col].dt.to_period('W').apply(lambda r: r.start_time)

    # Initialize a dictionary to store AUC for each week
    weekly_auc = {}

    for week_start_date, group in weekly_df.groupby('week_start_date'):
        if not group.empty:
            X = group[features]
            y = group[target_col]
            auc = roc_auc_score(y, model.predict_proba(X)[:, 1])
            weekly_auc[week_start_date] = auc

    return pd.Series(weekly_auc)

def acc_over_time(df, model, date_col, target_col, features):
    # Make a copy to avoid modifying the original dataframe and convert match_date to datetime
    weekly_df = df.copy()
    weekly_df[date_col] = pd.to_datetime(weekly_df[date_col])

    # Create a 'week_start_date' column for grouping that represents the start of the week
    weekly_df['week_start_date'] = weekly_df[date_col].dt.to_period('W').apply(lambda r: r.start_time)

    # Initialize a dictionary to store AUC for each week
    weekly_auc = {}

    for week_start_date, group in weekly_df.groupby('week_start_date'):
        if not group.empty:
            X = group[features]
            y = group[target_col]
            auc = accuracy_score(y, model.predict(X))
            weekly_auc[week_start_date] = auc

    return pd.Series(weekly_auc)
Code
# Calculate weekly AUC for training and test sets
weekly_auc_train = auc_over_time(dataset_train, model, 'match_date', 'target', features)
weekly_auc_test = auc_over_time(dataset_test, model, 'match_date', 'target', features)

# Plotting the AUC over time using Plotly
trace0 = go.Scatter(
    x=weekly_auc_train.index,
    y=weekly_auc_train.values,
    mode='lines+markers',
    name='Training Set',
    line=dict(color='blue')
)

trace1 = go.Scatter(
    x=weekly_auc_test.index,
    y=weekly_auc_test.values,
    mode='lines+markers',
    name='Test Set',
    line=dict(color='red')
)

layout = go.Layout(
    title='AUC Over Time',
    xaxis=dict(title='Week Start Date'),
    yaxis=dict(title='AUC'),
    showlegend=True
)

fig = go.Figure(data=[trace0, trace1], layout=layout)

fig.add_hline(y=0.5, line_dash="dash", line_color="black",
              annotation_text="Random prediction", annotation_position="bottom right")

avg_train_auc = weekly_auc_train.mean()
avg_test_auc = weekly_auc_test.mean()

# Training set average line for the training period
fig.add_shape(type='line',
              x0=weekly_auc_train.index.min(), y0=avg_train_auc,
              x1=weekly_auc_train.index.max(), y1=avg_train_auc,
              line=dict(dash='dash', color='blue', width=2),
              xref='x', yref='y')

# Test set average line for the test period
fig.add_shape(type='line',
              x0=weekly_auc_test.index.min(), y0=avg_test_auc,
              x1=weekly_auc_test.index.max(), y1=avg_test_auc,
              line=dict(dash='dash', color='red', width=2),
              xref='x', yref='y')

# Add annotations for the averages
fig.add_annotation(x=weekly_auc_train.index.max(), y=avg_train_auc,
                   text=f"Train Avg: {avg_train_auc:.2f}", showarrow=False, yshift=10, bgcolor="white")
fig.add_annotation(x=weekly_auc_test.index.max(), y=avg_test_auc,
                   text=f"Test Avg: {avg_test_auc:.2f}", showarrow=False, yshift=10, bgcolor="white")


fig.show()
Code
# Calculate weekly AUC for training and test sets
weekly_acc_train = acc_over_time(dataset_train, model, 'match_date', 'target', features)
weekly_acc_test = acc_over_time(dataset_test, model, 'match_date', 'target', features)

# Plotting the AUC over time using Plotly
trace0 = go.Scatter(
    x=weekly_acc_train.index,
    y=weekly_acc_train.values,
    mode='lines+markers',
    name='Training Set',
    line=dict(color='blue')
)

trace1 = go.Scatter(
    x=weekly_acc_test.index,
    y=weekly_acc_test.values,
    mode='lines+markers',
    name='Test Set',
    line=dict(color='red')
)

layout = go.Layout(
    title='Accuracy Over Time',
    xaxis=dict(title='Week Start Date'),
    yaxis=dict(title='Accuracy'),
    showlegend=True
)

fig = go.Figure(data=[trace0, trace1], layout=layout)

fig.add_hline(y=0.5, line_dash="dash", line_color="black",
              annotation_text="Random prediction", annotation_position="bottom right")

avg_train_acc = weekly_acc_train.mean()
avg_test_acc = weekly_acc_test.mean()

# Training set average line for the training period
fig.add_shape(type='line',
              x0=weekly_acc_train.index.min(), y0=avg_train_acc,
              x1=weekly_acc_train.index.max(), y1=avg_train_acc,
              line=dict(dash='dash', color='blue', width=2),
              xref='x', yref='y')

# Test set average line for the test period
fig.add_shape(type='line',
              x0=weekly_acc_test.index.min(), y0=avg_test_acc,
              x1=weekly_acc_test.index.max(), y1=avg_test_acc,
              line=dict(dash='dash', color='red', width=2),
              xref='x', yref='y')

# Add annotations for the averages
fig.add_annotation(x=weekly_acc_train.index.max(), y=avg_train_acc,
                   text=f"Train Avg: {avg_train_acc:.2f}", showarrow=False, yshift=10, bgcolor="white")
fig.add_annotation(x=weekly_acc_test.index.max(), y=avg_test_acc,
                   text=f"Test Avg: {avg_test_acc:.2f}", showarrow=False, yshift=10, bgcolor="white")


fig.show()

Backtesting

Past performance is no guarantee of future results.

Backtesting is all about replaying the past with your model decisions: Train model with data up to a certain date Make bets for the following day or week Repeat (1) and (2) until you cover all the test data Evaluate ML metrics (e.g. AUC) and business metrics (e.g ROI) on your bets

dataset_with_odds = pd.read_parquet("match_predictions_with_odds.parquet")
dataset_with_odds = dataset_with_odds[["match_id", "team1_odds", "team2_odds"]]
dataset_with_odds = dataset_with_odds.merge(dataset_test, on="match_id")
dataset_with_odds['match_date'] = pd.to_datetime(dataset_with_odds['match_date'])
dataset_with_odds = dataset_with_odds.sort_values(by='match_date')

dataset_with_odds.shape, dataset_test.shape
((1113837, 269), (4759, 267))
MIN_PROBA = 0.5
MIN_DELTA_PROBA = 0.01
N_SIMS = 100

all_samples_data = []  # List to store data from all samples for later aggregation

for _ in range(N_SIMS):
    df = dataset_with_odds.groupby('match_id').apply(lambda x: x.sample(1)).reset_index(drop=True)
    predict_proba = model.predict_proba(df[features])
    df['team1_proba'] = predict_proba[:, 1]
    df['team2_proba'] = predict_proba[:, 0]
    df["team1_implied_prob"] = 1 / df["team1_odds"]
    df["team2_implied_prob"] = 1 / df["team2_odds"]
    df["team1_bet"] = (df.team1_proba > MIN_PROBA) & (df.team1_proba > (df.team1_implied_prob + MIN_DELTA_PROBA))
    df["team2_bet"] = (df.team2_proba > MIN_PROBA) & (df.team2_proba > (df.team2_implied_prob + MIN_DELTA_PROBA))
    df["team1_returns"] = np.where(df.team1_bet & (df.winner=='team1'), df["team1_odds"], 0.0)
    df["team2_returns"] = np.where(df.team2_bet & (df.winner=='team2'), df["team2_odds"], 0.0)
    df["loss"] = df["team1_bet"].astype(int) + df["team2_bet"].astype(int)
    df["revenue"] = df["team1_returns"] + df["team2_returns"]
    df["profit"] = df["revenue"] - df["loss"]
    all_samples_data.append(df)
all_samples_df = pd.concat(all_samples_data).reset_index(drop=True)
all_samples_df['match_date'] = pd.to_datetime(all_samples_df['match_date'])
all_samples_df.sort_values(by='match_date', inplace=True)
all_samples_df['cumulative_profit'] = all_samples_df.groupby('match_date')['profit'].cumsum()

daily_profit_sum = all_samples_df.groupby('match_date')['profit'].sum().reset_index()
daily_profit_sum['cumulative_profit'] = daily_profit_sum['profit'].cumsum()/N_SIMS

total_profits = all_samples_df['profit'].sum()
total_bets = all_samples_df['loss'].sum()  # This assumes that 'loss' is the number of bets in the all_samples_df
roi = total_profits / total_bets if total_bets > 0 else 0

roi
0.06400430432198898
Code
# Create a Plotly figure
fig = go.Figure()

# Add traces for each sample's cumulative profits
for sample_data in all_samples_data:
    # Make sure to sort the sample_data by 'match_date'
    sample_data_sorted = sample_data.sort_values(by='match_date')
    fig.add_trace(go.Scatter(
        x=sample_data_sorted['match_date'],
        y=sample_data_sorted['profit'].cumsum(),
        mode='lines',
        line=dict(width=1, color='lightgrey'),
        showlegend=False
    ))

# Add a trace for the average cumulative profits per date
fig.add_trace(go.Scatter(
    x=daily_profit_sum['match_date'],
    y=daily_profit_sum['cumulative_profit'],
    mode='lines',
    name='Avg Cum. Profits',
    line=dict(width=3, color='blue')
))

# Adding ROI text
fig.add_trace(go.Scatter(
    x=[daily_profit_sum['match_date'].iloc[-1] + pd.DateOffset(days=4)],
    y=[daily_profit_sum['cumulative_profit'].iloc[-1]],
    text=[f"ROI: {roi:.2f}"],  # The ROI text
    mode="text",
    showlegend=False,
    textfont=dict(  # Adjust the font properties here
        size=14,
        color='black',
    )
))

# Update layout to add titles and make it more informative
fig.update_layout(
    title="Cumulative Profits over Time with Average",
    xaxis_title="Match Date",
    yaxis_title="Cumulative Profit",
    legend_title="Legend",
    template="plotly_white",
    xaxis=dict(
        type='date'  # Ensure that x-axis is treated as date
    )
)

# Show the figure
fig.show()

Why did I lose money after all?

No risk management

No entry strategy

No exit strategy

“ML is all you need”

Emotions / vibes instead of rationality / planning

Conclusion

Don’t believe anyone who sells you how to trade or bet (with ML or otherwise)

Make sure you have an edge and be rigorous about its understanding

If you use predict_proba, make sure probabilities are calibrated

Focus on backtesting, validation and risk management

Whenever you bid for something, beware the winner’s curse

Making money directly with ML is hard, just find a ML job instead!

Footnotes

  1. For most businesses, what you really want to maximize is the net present value (NPV), which is the sum of cash flows over time, discounted back to their value in present terms. In other words, you care more about profits tomorrow than 10 years from now, but you just don’t want to ignore future profits. For simplicity, we will stick to present-day profits for now.↩︎

  2. That implies your utility is logarithmic, which is a reasonable assumption for most people. That is, if your wealth is \$100, you care a lot about winning or losing \$10. But if you are a millionaire, you would only start to worry about winning or losing hundreds of thousands of dollars.↩︎

  3. First-price auctions: the winner pays the amount they bid. Second-price auctions: the winner pays the amount the second-highest bidder bid. Second-price auctions are interesting because the optimal bid is the expected value of the product, so they are adopted in some situations e.g. Meta’s digital ad exchange.↩︎